{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 25 - Hypothesis testing with multiple categories\n", "\n", "We will learn how to test hypotheses when our categorical data has more than two categories.\n", "\n", "First, let's import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Jury Panels in Alameda Country 2009-2010\n", "\n", "We will look at jury panel data from 2009 and 2010 collected by the American Civil Liberties Union (ACLU). The total number of people who reported for jury duty in those years was 1,452. See [11.2 Multiple Categories](https://www.inferentialthinking.com/chapters/11/2/Multiple_Categories.html) for more information.\n", "\n", "We can create a dataframe with this data as shown below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create a dictionary listing each column, followed by the column values in a list\n", "jury_data = {\"Eligible\":[0.15, 0.18, 0.12, 0.54, 0.01],\n", " \"Panels\":[0.26, 0.08, 0.08, 0.54, 0.04]}\n", "# pass the dictionary into the dataframe creation function as a parameter\n", "# also pass in labels for the rows\n", "jury = pd.DataFrame(data = jury_data, index = [\"Asian\",\"Black\",\"Latino/a\",\"White\",\"Other\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think are the columns and rows of the new dataframe? Check your answer by displaying the dataframe `jury`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now make a bar chart of the two columns of data. Because our dataframe `jury` already contains counts of the different categories, we do not have to use the `value_count()` function. Instead, type `jury.plot(kind = \"bar\")` below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "How does the distribution of eligible jurors compare with the distribution of the jury panels?\n", "\n", "To know whether the variation between the distributions is just the result of chance, we can compute a random sample from the eligible distribution and compare it with the panel distribution.\n", "\n", "First we create a variable for the eligible population and distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "population = [\"Asian\",\"Black\",\"Latinx\",\"White\",\"Other\"]\n", "pop_prob = [0.15, 0.18, 0.12, 0.54, 0.01]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a sample from this population. The sample should be the same size as our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample = np.random.choice(population,p = pop_prob,size = 1453)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the value counts for the sample (you will have to make it a Pandas Series first). Since we only have the probabilities of the eligible population, we want to compute the value counts as probabilities as well. We can do this by adding the parameter `normalize = True`. Save the probabilities of the sample in the variable `sample_probs`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " sample_probs = pd.Series(sample).value_counts(normalize = True)\n", "\n", "
\n", "\n", "Next we will create a new column in our dataframe called `Random` that contains the probabilities from our random sample. To do this, type `jury[\"Random\"] = sample_probs` below and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the `jury` dataframe again to check that the column was added." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the bar chart of the dataframe again, and the new column will be included." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the distribution of the random sample compare to the eligible distribution? To the panel distribution?\n", "\n", "Let's compare the panels distribution to the eligible distribution quantitatively using hypothesis testing. We need to choose a statistic to simulate, called the *test statistic*. In this problem, we will use something called the *Total Variation Distance (TVD)* as the test statistic. The TVD measures the difference between two distributions. \n", "\n", "First we will compute the TVD between the panels and eligible distribution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jury[\"Difference\"] = jury[\"Panels\"] - jury[\"Eligible\"]\n", "jury" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the sum of the difference column? \n", "\n", "To fix this, we will take the absolute differences between probabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jury[\"Absolute Difference\"] = np.abs(jury[\"Difference\"])\n", "jury" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does this do?\n", "\n", "Now take the sum of the absolute difference column. You can use the same command as when we took the sum of a filter." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "jury[\"Absolute Difference\"].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " jury[\"Absolute Difference\"].sum()\n", "\n", "
\n", "\n", "Notice this sum is twice either the positive or negative count, so we divide it by two. This quantity is the *total variation distance (TVD)* between the distribution of ethnicity in the eligible juror population and the panel." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_tvd = jury[\"Absolute Difference\"].sum()/2\n", "data_tvd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could have done this calculation in one line of code:\n", "`np.abs(jury[\"Panel\"] - jury[\"Eligible\"]).sum()/2`\n", "Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we want to understand the distribution of the test statistic (here, the TVD) if the panels were actually from the eligible distribution. To do this, we want to simulate random samples from the eligible distribution and compute the total variation distance between the sample and eligible distribution.\n", "\n", "First compute for one random sample, and compute the TVD between its probabilities and the eligible distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to repeat this process many times, and make a histogram of the difference TVD values. First, use a loop to generate many samples and compute the TVD to the eligible population." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next make the histogram of these simulated test statistics (the TVDs)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does the test statistic computed from the data look like it comes from this distribution?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }